Question: Should Iowa have won the NCAA National Championship for women’s basketball?
Iowa Dataset
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)library(here)
here() starts at /Users/abigailsikora/Desktop/ds334_final_project
library(rvest)
Attaching package: 'rvest'
The following object is masked from 'package:readr':
guess_encoding
#Iowaurl2 <-"https://www.espn.com/womens-college-basketball/team/stats/_/id/2294/iowa-hawkeyes"h2 <-read_html(url2)tab2 <- h2 |>html_nodes("table")#stats with no namesiowa_df <- tab2[[4]] |>html_table(fill =TRUE)#names tableiowa_df2 <- tab2[[3]] |>html_table(fill =TRUE)iowa_stats <-bind_cols(iowa_df2, iowa_df)#South Carolinaurl3 <-"https://www.espn.com/womens-college-basketball/team/stats/_/id/2579/south-carolina-gamecocks"h3 <-read_html(url3)tab3 <- h3 |>html_nodes("table")#stats with no namessc_df <- tab3[[4]] |>html_table(fill =TRUE)#names tablesc_df2 <- tab3[[3]] |>html_table(fill =TRUE)sc_stats <-bind_cols(sc_df2, sc_df)
The data sets above are season statistics for women’s NCAA basketball teams the Iowa Hawkeyes and the South Carolina Gamecocks.
These data sets includes variables:
Name: Name of the player
MIN: Minutes Per Game
FGM: Field Goals Made
FGA: Field Goals Attempted
FTM: Free Throws Made
FTA: Free Throws Attempted
3PM: 3-Pointers Made
3PA: 3-Pointers Attempted
PTS: Points
OR: Offensive Rebounds
DR: Defensive Rebounds
REB: Rebounds (Offensive and Defensive Total)
AST: Assists
TO: Turnovers
STL: Steals
BLK: Blocks
First, let’s compare points between the teams. To make this data even, we will only look at the top ten players for amount of time players (MIN) from each team because they have a different amount of players for this and the following few tables.
library(pander)summary_iowa <- iowa_stats |>arrange(desc(MIN)) |>slice(1:10) |>summarize(points_avg =mean(PTS))summary_south_carolina <- sc_stats |>arrange(desc(MIN)) |>slice(1:10) |>summarize(points_avg =mean(PTS))combined_summary <-bind_rows(mutate(summary_iowa, Team ="Iowa"),mutate(summary_south_carolina, Team ="South Carolina")) |>pander()
From this table, we see that Iowa has 27.4 more points on average for the season. To analyze this number further, I want to look at Field Goals Made, because the number of points doesn’t tell us much on it’s own.
Now let’s look at Field Goals Made for each team. The difference between FGM and PTS is that FGM is the count of baskets made by each team, and PTS is the total number of points the team has by the point value of the Field Goal scored (1 - free throw, 2 - from inside the arch or 3 - anywhere beyond the arch).
summary_iowa2 <- iowa_stats |>arrange(desc(MIN)) |>slice(1:10) |>summarize(fg_made =mean(FGM))summary_south_carolina2 <- sc_stats |>arrange(desc(MIN)) |>slice(1:10) |>summarize(fg_made =mean(FGM))combined_summary2 <-bind_rows(mutate(summary_iowa2, Team ="Iowa"),mutate(summary_south_carolina2, Team ="South Carolina")) |>pander()
From this table, we can see that this time, South Carolina has a better number but not by much(<1). This tells us that although Iowa has a higher average of points total, the accuracy of the two teams is pretty similar when it comes to average field goals actually made. An explanation for the first table could be Iowa may have more high value points, so I want to look at that next.
summary_iowa3 <- iowa_stats |>arrange(desc(MIN)) |>slice(1:10) |>summarize(`3s_made`=mean(`3PM`))summary_south_carolina3 <- sc_stats |>arrange(desc(MIN)) |>slice(1:10) |>summarize(`3s_made`=mean(`3PM`))combined_summary3 <-bind_rows(mutate(summary_iowa3, Team ="Iowa"),mutate(summary_south_carolina3, Team ="South Carolina")) |>pander()
From this, we can see something that I had a feeling about from the previous tables. Iowa has almost double the amount of three pointers made than South Carolina. This tells us that the points average has a bigger margin of difference between the teams because Iowa simply scores higher value points more often than South Carolina.
Next out of curiosity after seeing the first few numbers, I want to compare the teams by overall points per season, seeing if there are any outliers on either team skewing these average numbers.
(Instead of looking at top ten players by Minute, we will just look at all the players on each team to get a better comparison as a whole.)
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
From analyzing these two plots, we see something really interesting. Right off the bat, we see that as a team, South Carolina looks like it has more even scoring between players, with a smooth decreasing trend from the top scorer. Iowa on the other hand, seems to have an outlier right at the top. Caitlin Clark(1234) has 724 more points than the next best scorer on the team(510), and that is more points than the top scorer on South Carolina has total(474).
This answers some grey area we had with the average points comparison between teams. Caitlin Clark is an obvious outlier here even looking at both teams.